{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Pima Indians Diabetes Data Set数据探索\n",
    "\n",
    "数据说明：\n",
    "Pima Indians Diabetes Data Set（皮马印第安人糖尿病数据集） 根据现有的医疗信息预测5年内皮马印第安人糖尿病发作的概率。   \n",
    "\n",
    "数据集共9个字段: \n",
    "0列为怀孕次数；\n",
    "1列为口服葡萄糖耐量试验中2小时后的血浆葡萄糖浓度；\n",
    "2列为舒张压（单位:mm Hg）\n",
    "3列为三头肌皮褶厚度（单位：mm）\n",
    "4列为餐后血清胰岛素（单位:mm）\n",
    "5列为体重指数（体重（公斤）/ 身高（米）^2）\n",
    "6列为糖尿病家系作用\n",
    "7列为年龄\n",
    "8列为分类变量（0或1）\n",
    "\n",
    "数据链接：https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "import必要的工具包，用于文件读取／特征编码"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据文件路径和文件名"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pregnants</th>\n",
       "      <th>Plasma_glucose_concentration</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>Triceps_skin_fold_thickness</th>\n",
       "      <th>serum_insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>Diabetes_pedigree_function</th>\n",
       "      <th>Age</th>\n",
       "      <th>Target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>6</td>\n",
       "      <td>148</td>\n",
       "      <td>72</td>\n",
       "      <td>35</td>\n",
       "      <td>0</td>\n",
       "      <td>33.6</td>\n",
       "      <td>0.627</td>\n",
       "      <td>50</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>85</td>\n",
       "      <td>66</td>\n",
       "      <td>29</td>\n",
       "      <td>0</td>\n",
       "      <td>26.6</td>\n",
       "      <td>0.351</td>\n",
       "      <td>31</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>183</td>\n",
       "      <td>64</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>23.3</td>\n",
       "      <td>0.672</td>\n",
       "      <td>32</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>89</td>\n",
       "      <td>66</td>\n",
       "      <td>23</td>\n",
       "      <td>94</td>\n",
       "      <td>28.1</td>\n",
       "      <td>0.167</td>\n",
       "      <td>21</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>137</td>\n",
       "      <td>40</td>\n",
       "      <td>35</td>\n",
       "      <td>168</td>\n",
       "      <td>43.1</td>\n",
       "      <td>2.288</td>\n",
       "      <td>33</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pregnants  Plasma_glucose_concentration  blood_pressure  \\\n",
       "0          6                           148              72   \n",
       "1          1                            85              66   \n",
       "2          8                           183              64   \n",
       "3          1                            89              66   \n",
       "4          0                           137              40   \n",
       "\n",
       "   Triceps_skin_fold_thickness  serum_insulin   BMI  \\\n",
       "0                           35              0  33.6   \n",
       "1                           29              0  26.6   \n",
       "2                            0              0  23.3   \n",
       "3                           23             94  28.1   \n",
       "4                           35            168  43.1   \n",
       "\n",
       "   Diabetes_pedigree_function  Age  Target  \n",
       "0                       0.627   50       1  \n",
       "1                       0.351   31       0  \n",
       "2                       0.672   32       1  \n",
       "3                       0.167   21       0  \n",
       "4                       2.288   33       1  "
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#input data\n",
    "train = pd.read_csv(\"pima-indians-diabetes.csv\")\n",
    "train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#train.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "粗看数据集没有缺失值\n",
    "但该数据集已知存在缺失值，某些列中存在的缺失值被标记为0。通过这些列中指标的定义和相应领域的常识可以证实上述观点，譬如体重指数和血压两列中的0作为指标数值来说是无意义的。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pregnants</th>\n",
       "      <th>Plasma_glucose_concentration</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>Triceps_skin_fold_thickness</th>\n",
       "      <th>serum_insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>Diabetes_pedigree_function</th>\n",
       "      <th>Age</th>\n",
       "      <th>Target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "      <td>768.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>3.845052</td>\n",
       "      <td>120.894531</td>\n",
       "      <td>69.105469</td>\n",
       "      <td>20.536458</td>\n",
       "      <td>79.799479</td>\n",
       "      <td>31.992578</td>\n",
       "      <td>0.471876</td>\n",
       "      <td>33.240885</td>\n",
       "      <td>0.348958</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>3.369578</td>\n",
       "      <td>31.972618</td>\n",
       "      <td>19.355807</td>\n",
       "      <td>15.952218</td>\n",
       "      <td>115.244002</td>\n",
       "      <td>7.884160</td>\n",
       "      <td>0.331329</td>\n",
       "      <td>11.760232</td>\n",
       "      <td>0.476951</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.078000</td>\n",
       "      <td>21.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1.000000</td>\n",
       "      <td>99.000000</td>\n",
       "      <td>62.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>27.300000</td>\n",
       "      <td>0.243750</td>\n",
       "      <td>24.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>117.000000</td>\n",
       "      <td>72.000000</td>\n",
       "      <td>23.000000</td>\n",
       "      <td>30.500000</td>\n",
       "      <td>32.000000</td>\n",
       "      <td>0.372500</td>\n",
       "      <td>29.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>6.000000</td>\n",
       "      <td>140.250000</td>\n",
       "      <td>80.000000</td>\n",
       "      <td>32.000000</td>\n",
       "      <td>127.250000</td>\n",
       "      <td>36.600000</td>\n",
       "      <td>0.626250</td>\n",
       "      <td>41.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>17.000000</td>\n",
       "      <td>199.000000</td>\n",
       "      <td>122.000000</td>\n",
       "      <td>99.000000</td>\n",
       "      <td>846.000000</td>\n",
       "      <td>67.100000</td>\n",
       "      <td>2.420000</td>\n",
       "      <td>81.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        pregnants  Plasma_glucose_concentration  blood_pressure  \\\n",
       "count  768.000000                    768.000000      768.000000   \n",
       "mean     3.845052                    120.894531       69.105469   \n",
       "std      3.369578                     31.972618       19.355807   \n",
       "min      0.000000                      0.000000        0.000000   \n",
       "25%      1.000000                     99.000000       62.000000   \n",
       "50%      3.000000                    117.000000       72.000000   \n",
       "75%      6.000000                    140.250000       80.000000   \n",
       "max     17.000000                    199.000000      122.000000   \n",
       "\n",
       "       Triceps_skin_fold_thickness  serum_insulin         BMI  \\\n",
       "count                   768.000000     768.000000  768.000000   \n",
       "mean                     20.536458      79.799479   31.992578   \n",
       "std                      15.952218     115.244002    7.884160   \n",
       "min                       0.000000       0.000000    0.000000   \n",
       "25%                       0.000000       0.000000   27.300000   \n",
       "50%                      23.000000      30.500000   32.000000   \n",
       "75%                      32.000000     127.250000   36.600000   \n",
       "max                      99.000000     846.000000   67.100000   \n",
       "\n",
       "       Diabetes_pedigree_function         Age      Target  \n",
       "count                  768.000000  768.000000  768.000000  \n",
       "mean                     0.471876   33.240885    0.348958  \n",
       "std                      0.331329   11.760232    0.476951  \n",
       "min                      0.078000   21.000000    0.000000  \n",
       "25%                      0.243750   24.000000    0.000000  \n",
       "50%                      0.372500   29.000000    0.000000  \n",
       "75%                      0.626250   41.000000    1.000000  \n",
       "max                      2.420000   81.000000    1.000000  "
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#查看数值型特征的基本统计量\n",
    "train.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "从结果中我们可以看到很多列的最小值为0。而在一些特定列代表的变量中，0值并没有意义，这就表名该值无效或为缺失值。\n",
    "\n",
    "具体来说，下列变量的最小值为0时数据无意义：\n",
    "1、血浆葡萄糖浓度\n",
    "2、舒张压\n",
    "3、肱三头肌皮褶厚度\n",
    "4、餐后血清胰岛素\n",
    "5、体重指数\n",
    "\n",
    "在Pandas的DataFrame中，通过replace()函数可以很方便的将我们感兴趣的数据子集的值标记为NaN。\n",
    "\n",
    "标记完缺失值之后，可以利用isnull()函数将数据集中所有的NaN值标记为True，然后就可以得到每一列中缺失值的数量了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pregnants                         0\n",
      "Plasma_glucose_concentration      5\n",
      "blood_pressure                   35\n",
      "Triceps_skin_fold_thickness     227\n",
      "serum_insulin                   374\n",
      "BMI                              11\n",
      "Diabetes_pedigree_function        0\n",
      "Age                               0\n",
      "Target                            0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "NaN_col_names = ['Plasma_glucose_concentration','blood_pressure','Triceps_skin_fold_thickness','serum_insulin','BMI']\n",
    "train[NaN_col_names] = train[NaN_col_names].replace(0, np.NaN)\n",
    "print(train.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 对缺失值较多的特征，新增一个特征，表示这个特征是否缺失"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Triceps_skin_fold_thickness</th>\n",
       "      <th>Triceps_skin_fold_thickness_Missing</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>29.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>23.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>32.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>45.0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   Triceps_skin_fold_thickness  Triceps_skin_fold_thickness_Missing\n",
       "0                         35.0                                    0\n",
       "1                         29.0                                    0\n",
       "2                          NaN                                    1\n",
       "3                         23.0                                    0\n",
       "4                         35.0                                    0\n",
       "5                          NaN                                    1\n",
       "6                         32.0                                    0\n",
       "7                          NaN                                    1\n",
       "8                         45.0                                    0\n",
       "9                          NaN                                    1"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#缺失值比较多，干脆就开一个新的字段，表明是缺失值还是不是缺失值\n",
    "train['Triceps_skin_fold_thickness_Missing'] = train['Triceps_skin_fold_thickness'].apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "train[['Triceps_skin_fold_thickness','Triceps_skin_fold_thickness_Missing']].head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x21e2812be48>"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEHCAYAAABBW1qbAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAahUlEQVR4nO3df5AV9Z3u8ffjAIIRo8DgAgNCFLkLugxxxMRoVoVauGZVSKIFNz9QNJgqkphNZFdTKUWUqiSbqMkadTEaMTEg10RFN2rEjbJm1XEkiIBLidELAyiIP1YWURk/94/uaY7DmeEMTJ8zMM+r6tSc8+1vd3/6DJxnvt19uhURmJmZARxU6QLMzKzzcCiYmVnGoWBmZhmHgpmZZRwKZmaW6VbpAvZFv379YujQoZUuw8xsv/Lss8++HhHVxabt16EwdOhQGhoaKl2Gmdl+RdL/a22adx+ZmVnGoWBmZhmHgpmZZfbrYwpmZpXywQcf0NjYyI4dOypdSqt69uxJTU0N3bt3L3keh4KZ2V5obGykd+/eDB06FEmVLmc3EcHWrVtpbGxk2LBhJc/n3UdmZnthx44d9O3bt1MGAoAk+vbt2+6RjEPBzGwvddZAaLY39TkUzMws42MKZmYdZOvWrYwbNw6AV199laqqKqqrky8O19fX06NHjw5f57Jly9i8eTMTJ07skOV1+VA4YdYdlS6h03j2n79a6RLM9mt9+/Zl+fLlAMyePZtDDz2USy+9tOT5m5qaqKqqatc6ly1bxsqVKzssFLz7yMysDM466yxOOOEERo0axS9+8QsAdu7cyeGHH873v/99xo4dS319PYsXL2bEiBGceuqpfPOb32TSpEkAbNu2jfPPP5+xY8cyZswY7r//ft59913mzJnDnXfeSW1tLXffffc+19nlRwpmZuUwf/58+vTpw/bt26mrq+MLX/gCvXv35u233+aTn/wk11xzDdu3b+fYY4/lT3/6E0OGDOG8887L5p8zZw4TJ07k9ttv58033+Skk05ixYoVXHHFFaxcuZLrr7++Q+rMbaQgqaekeknPSVol6aq0fbakDZKWp48zC+a5XNJaSWskTcirNjOzcrvuuusYPXo0n/70p2lsbOSll14CoEePHkyePBmA1atXM2LECI466igkMXXq1Gz+P/zhD8ydO5fa2lpOP/10duzYwbp16zq8zjxHCu8BZ0TENkndgSckPZhOuy4iflzYWdJIYAowChgILJF0bEQ05VijmVnulixZwtKlS3nqqafo1asXp5xySvb9gV69emWnjkZEq8uICO69916OPvroj7QvXbq0Q2vNbaQQiW3py+7po/UthnOAhRHxXkS8DKwFxuZVn5lZubz99tv06dOHXr16sWrVKp555pmi/UaNGsWaNWtYv349EcFdd92VTZswYQI/+9nPstd//vOfAejduzfvvPNOh9Wa64FmSVWSlgObgUci4ul00jckrZB0m6Qj0rZBwPqC2RvTtpbLnCGpQVLDli1b8izfzKxDfO5zn2P79u2MHj2aOXPmcNJJJxXtd8ghh3DDDTcwfvx4Tj31VAYOHMjHP/5xAK688kq2b9/O8ccfz6hRo5g9ezYAZ5xxBs899xxjxozp/Aea010/tZIOB+6RdBxwE3A1yajhauAnwHSg2FfvdhtZRMQ8YB5AXV1dWyMPM7OKaf7QhuTCdA8//HDRfm+99dZHXo8fP541a9YQEVx88cXU1dUB8LGPfYxbbrllt/mrq6s79GZjZTklNSLeAh4DJkbEaxHRFBEfArewaxdRIzC4YLYaYGM56jMz6yxuuukmamtrGTlyJO+++y5f+9rXyrr+3EYKkqqBDyLiLUm9gPHADyUNiIhNabfJwMr0+WLgN5KuJTnQPByoz6s+M7POaNasWcyaNati689z99EAYL6kKpIRyaKIeEDSryTVkuwaegW4GCAiVklaBKwGdgIzfeaRmVl55RYKEbECGFOk/SttzDMXmJtXTWZm1jZf5sLMzDIOBTMzy/jaR2ZmHaCjr7hcylWLH3roIS655BKampq46KKLuOyyy/Z5vR4pmJnth5qampg5cyYPPvggq1evZsGCBaxevXqfl+tQMDPbD9XX13PMMcfwiU98gh49ejBlyhTuu+++fV6uQ8HMbD+0YcMGBg/e9X3fmpoaNmzYsM/LdSiYme2Hil1Rtflqq/vCoWBmth+qqalh/fpd1xBtbGxk4MCB+7xch4KZ2X7oxBNP5MUXX+Tll1/m/fffZ+HChZx99tn7vFyfkmpm1gFKOYW0I3Xr1o0bbriBCRMm0NTUxPTp0xk1atS+L7cDajMzswo488wzOfPMM/fcsR28+8jMzDIOBTMzyzgUzMws41AwM7OMQ8HMzDIOBTMzy/iUVDOzDrBuzvEdurwhVzy/xz7Tp0/ngQceoH///qxcuXKP/UvhkYKZ2X7q/PPP56GHHurQZToUzMz2U5/97Gfp06dPhy4zt1CQ1FNSvaTnJK2SdFXa3kfSI5JeTH8eUTDP5ZLWSlojaUJetZmZWXF5jhTeA86IiNFALTBR0qeAy4BHI2I48Gj6GkkjgSnAKGAicKOkqhzrMzOzFnILhUhsS192Tx8BnAPMT9vnA5PS5+cACyPivYh4GVgLjM2rPjMz212uxxQkVUlaDmwGHomIp4EjI2ITQPqzf9p9ELC+YPbGtK3lMmdIapDUsGXLljzLNzPrcnI9JTUimoBaSYcD90g6ro3uxW4ZtNuthSJiHjAPoK6ubvdbD5mZVUApp5B2tKlTp/LYY4/x+uuvU1NTw1VXXcWFF164T8ssy/cUIuItSY+RHCt4TdKAiNgkaQDJKAKSkcHggtlqgI3lqM/MbH+0YMGCDl9mnmcfVacjBCT1AsYD/wUsBqal3aYB96XPFwNTJB0saRgwHKjPqz4zM9tdniOFAcD89Ayig4BFEfGApCeBRZIuBNYB5wJExCpJi4DVwE5gZrr7yczMyiS3UIiIFcCYIu1bgXGtzDMXmJtXTWZmHSkikIodDu0cItp/2NXfaDYz2ws9e/Zk69ate/XBWw4RwdatW+nZs2e75vMF8czM9kJNTQ2NjY105lPje/bsSU1NTbvmcSiYme2F7t27M2zYsEqX0eG8+8jMzDIOBTMzyzgUzMws41AwM7OMQ8HMzDIOBTMzyzgUzMws41AwM7OMQ8HMzDIOBTMzyzgUzMws41AwM7OMQ8HMzDIOBTMzyzgUzMws41AwM7NMbqEgabCkP0p6QdIqSZek7bMlbZC0PH2cWTDP5ZLWSlojaUJetZmZWXF53nltJ/DdiFgmqTfwrKRH0mnXRcSPCztLGglMAUYBA4Elko6NiKYcazQzswK5jRQiYlNELEufvwO8AAxqY5ZzgIUR8V5EvAysBcbmVZ+Zme2uLMcUJA0FxgBPp03fkLRC0m2SjkjbBgHrC2ZrpEiISJohqUFSQ2e+YbaZ2f4o91CQdCjwW+DbEfHfwE3A0UAtsAn4SXPXIrPHbg0R8yKiLiLqqqurc6razKxryjUUJHUnCYQ7I+J3ABHxWkQ0RcSHwC3s2kXUCAwumL0G2JhnfWZm9lF5nn0k4FbghYi4tqB9QEG3ycDK9PliYIqkgyUNA4YD9XnVZ2Zmu8vz7KPPAF8Bnpe0PG37HjBVUi3JrqFXgIsBImKVpEXAapIzl2b6zCMzs/LKLRQi4gmKHyf4fRvzzAXm5lWTmZm1zd9oNjOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwyJYWCpEdLaTMzs/1bm/doltQTOAToJ+kIdt1z+TBgYM61mZlZme1ppHAx8Czwv9KfzY/7gJ+3NaOkwZL+KOkFSaskXZK295H0iKQX059HFMxzuaS1ktZImrAvG2ZmZu3XZihExE8jYhhwaUR8IiKGpY/REXHDHpa9E/huRPw18ClgpqSRwGXAoxExHHg0fU06bQowCpgI3Cipap+2zszM2qXN3UfNIuJfJJ0MDC2cJyLuaGOeTcCm9Pk7kl4ABgHnAKel3eYDjwH/lLYvjIj3gJclrQXGAk+2a4vMzGyvlRQKkn4FHA0sB5rS5gBaDYUW8w8FxgBPA0emgUFEbJLUP+02CHiqYLbGtK3lsmYAMwCGDBlSyurNzKxEJYUCUAeMjIho7wokHQr8Fvh2RPy3pFa7FmnbbX0RMQ+YB1BXV9fueszMrHWlfk9hJfBX7V24pO4kgXBnRPwubX5N0oB0+gBgc9reCAwumL0G2NjedZqZ2d4rNRT6AaslPSxpcfOjrRmUDAluBV6IiGsLJi0GpqXPp5GcydTcPkXSwZKGAcOB+lI3xMzM9l2pu49m78WyPwN8BXhe0vK07XvAD4BFki4E1gHnAkTEKkmLgNUkZy7NjIim3RdrZmZ5KfXso8fbu+CIeILixwkAxrUyz1xgbnvXZWZmHaPUs4/eYddB3x5Ad+B/IuKwvAozM7PyK3Wk0LvwtaRJJN8hMDOzA8heXSU1Iu4FzujgWszMrMJK3X30+YKXB5F8b8HfETAzO8CUevbRWQXPdwKvkFyWwszMDiClHlO4IO9CzMys8krdfVQD/AvJdw8CeAK4JCIac6zNymzdnOMrXUKnMeSK5ytdgllFlHqg+Zck3zgeSHKRuvvTNjMzO4CUGgrVEfHLiNiZPm4HqnOsy8zMKqDUUHhd0pclVaWPLwNb8yzMzMzKr9RQmA6cB7xKcuOcLwI++GxmdoAp9ZTUq4FpEfEmJPdZBn5MEhZmZnaAKHWk8DfNgQAQEW+Q3EnNzMwOIKWGwkGSjmh+kY4USh1lmJnZfqLUD/afAP8p6W6S7ymchy9xbWZ2wCn1G813SGoguQiegM9HxOpcKzMzs7IreRdQGgIOAjOzA9heXTrbzMwOTA4FMzPLOBTMzCyTWyhIuk3SZkkrC9pmS9ogaXn6OLNg2uWS1kpaI2lCXnWZmVnr8hwp3A5MLNJ+XUTUpo/fA0gaCUwBRqXz3CipKsfazMysiNxCISKWAm+U2P0cYGFEvBcRLwNrgbF51WZmZsVV4pjCNyStSHcvNX9LehCwvqBPY9q2G0kzJDVIatiyZUvetZqZdSnlDoWbgKOBWpKrrf4kbVeRvlFsARExLyLqIqKuutq3dDAz60hlDYWIeC0imiLiQ+AWdu0iagQGF3StATaWszYzMytzKEgaUPByMtB8ZtJiYIqkgyUNA4YD9eWszczMcrzSqaQFwGlAP0mNwJXAaZJqSXYNvQJcDBARqyQtIrmMxk5gZkQ05VWbmZkVl1soRMTUIs23ttF/Lr7yqplZRfmeCGad1Amz7qh0CZ3Gs//81UqX0GX4MhdmZpZxKJiZWcahYGZmGYeCmZllHApmZpZxKJiZWcahYGZmGYeCmZllHApmZpZxKJiZWcahYGZmGYeCmZllHApmZpZxKJiZWcahYGZmGYeCmZllHApmZpZxKJiZWSa3UJB0m6TNklYWtPWR9IikF9OfRxRMu1zSWklrJE3Iqy4zM2tdniOF24GJLdouAx6NiOHAo+lrJI0EpgCj0nlulFSVY21mZlZEbqEQEUuBN1o0nwPMT5/PByYVtC+MiPci4mVgLTA2r9rMzKy4ch9TODIiNgGkP/un7YOA9QX9GtM2MzMro85yoFlF2qJoR2mGpAZJDVu2bMm5LDOzrqXcofCapAEA6c/NaXsjMLigXw2wsdgCImJeRNRFRF11dXWuxZqZdTXlDoXFwLT0+TTgvoL2KZIOljQMGA7Ul7k2M7Mur1teC5a0ADgN6CepEbgS+AGwSNKFwDrgXICIWCVpEbAa2AnMjIimvGozM7PicguFiJjayqRxrfSfC8zNqx4zM9uzznKg2czMOoHcRgpmZh1l3ZzjK11CpzHkiudzXb5HCmZmlnEomJlZxqFgZmYZh4KZmWUcCmZmlnEomJlZxqFgZmYZh4KZmWUcCmZmlnEomJlZxqFgZmYZh4KZmWUcCmZmlnEomJlZxqFgZmYZh4KZmWUcCmZmlnEomJlZpiK345T0CvAO0ATsjIg6SX2Au4ChwCvAeRHxZiXqMzPrqio5Ujg9Imojoi59fRnwaEQMBx5NX5uZWRl1pt1H5wDz0+fzgUkVrMXMrEuqVCgE8AdJz0qakbYdGRGbANKf/YvNKGmGpAZJDVu2bClTuWZmXUNFjikAn4mIjZL6A49I+q9SZ4yIecA8gLq6usirQDOzrqgiI4WI2Jj+3AzcA4wFXpM0ACD9ubkStZmZdWVlDwVJH5PUu/k58HfASmAxMC3tNg24r9y1mZl1dZXYfXQkcI+k5vX/JiIekvQMsEjShcA64NwK1GZm1qWVPRQi4i/A6CLtW4Fx5a7HzMx26UynpJqZWYU5FMzMLONQMDOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwyDgUzM8s4FMzMLONQMDOzjEPBzMwynS4UJE2UtEbSWkmXVboeM7OupFOFgqQq4OfA/wZGAlMljaxsVWZmXUenCgVgLLA2Iv4SEe8DC4FzKlyTmVmX0a3SBbQwCFhf8LoROKmwg6QZwIz05TZJa8pU2wHvKOgHvF7pOjqFK1XpCqyA/20W6Jh/m0e1NqGzhUKxrY2PvIiYB8wrTzldi6SGiKirdB1mLfnfZvl0tt1HjcDggtc1wMYK1WJm1uV0tlB4BhguaZikHsAUYHGFazIz6zI61e6jiNgp6RvAw0AVcFtErKpwWV2Jd8tZZ+V/m2WiiNhzLzMz6xI62+4jMzOrIIeCmZllHArmS4tYpyXpNkmbJa2sdC1dhUOhi/OlRayTux2YWOkiuhKHgvnSItZpRcRS4I1K19GVOBSs2KVFBlWoFjOrMIeC7fHSImbWdTgUzJcWMbOMQ8F8aREzyzgUuriI2Ak0X1rkBWCRLy1inYWkBcCTwAhJjZIurHRNBzpf5sLMzDIeKZiZWcahYGZmGYeCmZllHApmZpZxKJiZWcahYGZmGYdCFySpr6Tl6eNVSRsKXvdo0fdhSb0rVWtLkp6QVFukfa/qlDRS0nOS/ixpaCt9ukl6q5Vpv5Y0qY3lf0dSzxKWM1PSl9pYznhJ97a1LeUg6SJJIelvC9rOTdsmpa9/KWlEO5c7WdKsjq7X2q9T3aPZyiMitgK1AJJmA9si4seFfSSJ5HssE8pfYfvtQ52fB+6OiKs7sp4C3wFuA3a01Skifp7T+vPwPDAVeDx9PQV4rnliRFzQ3gVGxD0dU5rtK48ULCPpGEkrJd0MLAMGpN8iPTydfoGkFelf1r9M246U9DtJDZLqJX0qbb9G0nxJf5T0oqTpafug9K/95em6Tm6llm6SfiXp+bTft1pMr0r/Sp+dvm6UdHjBNtwqaZWkB5v/Ui+yjrNJvs39dUlL0rZ/TOdfKembReY5SNKNklZLuh/o18b7+Q9Af+A/mpeftv8gfQ+flNS/4P36dvr8WEn/nvZZ1nIEI+mk5vZ0vlslPS7pL5JmFvSblv5Olqc1H9Ta+yrpH9Jtek7Sr1vbptRjwMnpsg4DhgDZTXCaR3PtWVc6Ark+ff5rST+V9J/pNk1O26sk3Zz+Xu+X9JDaGKXZ3vFIwVoaCVwQEV8HSAYMIGk08E/AyRHxhqQ+af+fAT+KiKfSD68HgOPSaccDJwOHAcsk/RvwZeD+iPihkhv89GqljhOAfhFxfLr+wwumdQN+AyyLiB8WmXcEMDUinpf0O2ASyX0iPiIiFksaC7weEdenz79Eco+JKqBe0uPA6oLZvggMS7dxYDrt5mIbEBHXSfoucGpEvCWpG/Bx4PGIuEzStcB04ActZl0AzI6I+9NAOwg4Jn0fTgWuA86OiMb093MsMA44HHhBSaj/NTCZ5Pe1U9I8kr/oX2rlff1H4KiIeL/Fe13MhyTBMB44Erg3XV9Lrf0OS1lXf+AzJP+GFgH3AOeSXNb9eOCvSC7LUvS9t73nkYK19FJEPFOk/Qzgroh4A6D5J8kHw82SlpN8OBwhqfmD/t6I2BERm4GlwIkkF+C7SNKVwHERsa2VOtaSXO/mp5ImAG8XTLuV1gMBkpsGPZ8+fxYYuodtbnYq8NuI2B4R76Tbc0qLPp8FFkTEhxHRSPLh2B7vRsSDrdUm6QiSD9L7AdL3b3s6+TjgRuDv03U3eyAi3k/f5zeAapLfy4lAQ/q7+VvgaFp/X1cBv1ZyXOODErZjIUnITKFI4Kb2ZV33RmIFu+7vcQrJtbk+jIiN7Np9ZR3IoWAt/U8r7aL4fRYEjI2I2vQxKCLeTae17B8R8e/AacAm4E61cnA1Pe7xN8ATwLeAfy2Y/CdgnKSDW6n1vYLnTZQ+Ii52b4mi5ZXYr5j3C563Vltry9+Yzt/yQHux7RVwW8HvZUREXN3G+zqB5K/usSRBUrWH7XgS+CRwWES8VKzDPq6rcJvU4qflyKFgpVoCTGnebVSw+2gJULgfu/ADa5KkgyX1I/krvEHSUcCrETGP5P67Y4qtTFI1yYHu/wtcSfIB1Gxeut6F6S6ZjrIUmCypl6RDSW5L+h9F+kxJ988PIvkLvC3vACWfFRURbwKvSzoLQFJPSYekk98A/h74UbobqS1LgPPS9775jLMhxd7X9EO5Jg3sWSQjjUNaW3BaZwCXA99rrU9HravAE8AXlRhAMmqzDuZjClaSiFgh6UfAUkk7SXZ9XEgSCDdJuoDk39Mf2RUSzwAPktzE58qIeE3JAefvSPoA2EZyjKGYwcCtSnaaB8nxjMJ6fiRpLnC7pK920DbWK7lUc/Pus5vS4xKF/0/uBk4nObC6hiQk2jIPWCJpPaXfgP5LwL+m2/c+8IWCGjcpOUD++7a2O637qnTdB5Hspvk6yUii5fvaDfiNklN6DwJ+mO4+a1NE/NseuhT7HRZdV/Oxqz1YRLIbs/m9f5qP7la0DuBLZ1suJF1DegC30rXYgUPSoRGxLR2FPA2cFBFbKl3XgcQjBTPbnzyYngbbnWT06UDoYB4pWMVJamD3P1D+T0SsLtZ/L9dxM/CpFs3XRsQdHbT8xSTn6xe6NCKWFOvf2Um6iOQ7HIWWRsS3ivW3A4dDwczMMj77yMzMMg4FMzPLOBTMzCzjUDAzs8z/Bw4K5RC/ejImAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "#color = sns.color_palette()\n",
    "\n",
    "%matplotlib inline\n",
    "sns.countplot(x=\"Triceps_skin_fold_thickness_Missing\", hue=\"Target\",data=train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<matplotlib.axes._subplots.AxesSubplot at 0x21e2817fcc8>"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYUAAAEHCAYAAABBW1qbAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjMsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+AADFEAAAVU0lEQVR4nO3de7CcdZ3n8ffHEAxCVg0EFzhAAiK7iZggIayrOCrskMWVixcGanVgQXFm0cKpgRp0LYgZs+vWeBuH0S1ABtxVLusNdC0QKZXSHQ0BERKYKApLDiCEiAgTAiR+94/z5KFJTpJOcvr0OTnvV1VX9/Pr5/d7vt116nz6ufSvU1VIkgTwon4XIEkaOwwFSVLLUJAktQwFSVLLUJAktXbpdwE7Yq+99qoZM2b0uwxJGlduu+22x6pq+nDPjetQmDFjBkuXLu13GZI0riT5f5t7zsNHkqSWoSBJahkKkqTWuD6nIEn98txzzzE4OMjatWv7XcpmTZkyhYGBASZPntx1H0NBkrbD4OAgU6dOZcaMGSTpdzmbqCpWr17N4OAgM2fO7Lqfh48kaTusXbuWPffcc0wGAkAS9txzz23ekzEUJGk7jdVA2GB76jMUJEktzylI0ghZvXo1xxxzDAC/+c1vmDRpEtOnD31xeMmSJey6664jvs3bb7+dRx99lAULFozIeBM+FI44/0v9LmHMuO1v/rTfJUjj2p577skdd9wBwMKFC9ljjz0477zzuu6/fv16Jk2atE3bvP3221m2bNmIhYKHjyRpFLztbW/jiCOOYPbs2Vx22WUArFu3jpe97GV89KMfZf78+SxZsoTrr7+eQw89lKOPPpoPfvCDnHTSSQA89dRTnHHGGcyfP5/DDz+cb33rWzz99NMsWrSIL3/5y8ydO5evfvWrO1znhN9TkKTRcOWVVzJt2jTWrFnDvHnzeMc73sHUqVN54okneO1rX8vHP/5x1qxZw6te9Sp+/OMfc8ABB3DKKae0/RctWsSCBQu44oorePzxxznqqKO48847ufDCC1m2bBmf/exnR6RO9xQkaRR85jOfYc6cObzuda9jcHCQX/3qVwDsuuuunHzyyQDcfffdHHrooRx44IEk4bTTTmv7f/e732Xx4sXMnTuXN7/5zaxdu5YHHnhgxOt0T0GSeux73/set9xyCz/5yU/YbbfdeMMb3tB+f2C33XZrLx2tqs2OUVV885vf5OCDD35B+y233DKitbqnIEk99sQTTzBt2jR22203li9fzq233jrserNnz2bFihWsXLmSquKaa65pnzvuuOP43Oc+1y7/7Gc/A2Dq1Kk8+eSTI1aroSBJPfbWt76VNWvWMGfOHBYtWsRRRx017HoveclLuPjiizn22GM5+uij2XfffXnpS18KwEUXXcSaNWs47LDDmD17NgsXLgTgLW95Cz//+c85/PDDPdEsSWPVhn/aMDQx3Y033jjser/73e9esHzssceyYsUKqor3v//9zJs3D4Ddd9+dSy+9dJP+06dPH9EfG+vZnkKS/ZN8P8k9SZYnObdpX5jkwSR3NLfjO/p8OMm9SVYkOa5XtUnSWPWFL3yBuXPnMmvWLJ5++mne9773jer2e7mnsA74y6q6PclU4LYkNzXPfaaqPtm5cpJZwKnAbGBf4HtJXlVV63tYoySNKeeffz7nn39+37bfsz2Fqnq4qm5vHj8J3APst4UuJwJXV9UzVXUfcC8wv1f1SZI2NSonmpPMAA4Hfto0fSDJnUkuT/Lypm0/YGVHt0GGCZEkZydZmmTpqlWreli1JE08PQ+FJHsAXwM+VFW/B74AHAzMBR4GPrVh1WG6b3LRblVdUlXzqmrehommJEkjo6ehkGQyQ4Hw5ar6OkBVPVJV66vqD8ClPH+IaBDYv6P7APBQL+uTJL1Qz040Z+grel8E7qmqT3e071NVDzeLJwPLmsfXA19J8mmGTjQfAizpVX2SNJJGesblbmYtvuGGGzj33HNZv349733ve7ngggt2eLu9vPro9cB7gLuS3NG0fQQ4Lclchg4N3Q+8H6Cqlie5FriboSuXzvHKI0ka3vr16znnnHO46aabGBgY4Mgjj+SEE05g1qxZOzRuz0Khqn7E8OcJvrOFPouBxb2qSZJ2FkuWLOGVr3wlBx10EACnnnoq11133Q6HgtNcSNI49OCDD7L//s+fhh0YGODBBx/c4XENBUkah4abUXXDbKs7wlCQpHFoYGCAlSuf/2rX4OAg++677w6PayhI0jh05JFH8stf/pL77ruPZ599lquvvpoTTjhhh8d1llRJGgHdXEI6knbZZRcuvvhijjvuONavX8+ZZ57J7Nmzd3zcEahNUg+M9HXv49lo/8MdL44//niOP/74ra+4DTx8JElqGQqSpJahIElqGQqSpJahIElqGQqSpJaXpErSCHhg0WEjOt4BF9611XXOPPNMvv3tb7P33nuzbNmyra7fDfcUJGmcOuOMM7jhhhtGdExDQZLGqTe+8Y1MmzZtRMc0FCRJLUNBktQyFCRJLUNBktTyklRJGgHdXEI60k477TR+8IMf8NhjjzEwMMDHPvYxzjrrrB0a01CQpHHqqquuGvExPXwkSWoZCpKklqEgSdupqvpdwhZtT32GgiRthylTprB69eoxGwxVxerVq5kyZco29fNEsyRth4GBAQYHB1m1alW/S9msKVOmMDAwsE19DAVJ2g6TJ09m5syZ/S5jxHn4SJLUMhQkSS1DQZLU6lkoJNk/yfeT3JNkeZJzm/ZpSW5K8svm/uUdfT6c5N4kK5Ic16vaJEnD6+WewjrgL6vqXwP/BjgnySzgAuDmqjoEuLlZpnnuVGA2sAD4fJJJPaxPkrSRnoVCVT1cVbc3j58E7gH2A04ErmxWuxI4qXl8InB1VT1TVfcB9wLze1WfJGlTo3JOIckM4HDgp8ArquphGAoOYO9mtf2AlR3dBpu2jcc6O8nSJEvH8vXBkjQe9TwUkuwBfA34UFX9fkurDtO2yVcFq+qSqppXVfOmT58+UmVKkuhxKCSZzFAgfLmqvt40P5Jkn+b5fYBHm/ZBYP+O7gPAQ72sT5L0Qr28+ijAF4F7qurTHU9dD5zePD4duK6j/dQkL04yEzgEWNKr+iRJm+rlNBevB94D3JXkjqbtI8AngGuTnAU8ALwLoKqWJ7kWuJuhK5fOqar1PaxPkrSRnoVCVf2I4c8TAByzmT6LgcW9qkmStGV+o1mS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEktQ0GS1DIUJEmtnoVCksuTPJpkWUfbwiQPJrmjuR3f8dyHk9ybZEWS43pVlyRp87oKhSQ3d9O2kSuABcO0f6aq5ja37zRjzQJOBWY3fT6fZFI3tUmSRs4WQyHJlCTTgL2SvDzJtOY2A9h3S32r6hbgt13WcSJwdVU9U1X3AfcC87vsK0kaIVvbU3g/cBvwr5r7DbfrgL/fzm1+IMmdzeGllzdt+wErO9YZbNo2keTsJEuTLF21atV2liBJGs4WQ6Gq/raqZgLnVdVBVTWzuc2pqou3Y3tfAA4G5gIPA59q2jPc5jdT0yVVNa+q5k2fPn07SpAkbc4u3axUVX+X5N8CMzr7VNWXtmVjVfXIhsdJLgW+3SwOAvt3rDoAPLQtY0uSdlxXoZDkfzL0Cf8OYH3TXMA2hUKSfarq4WbxZGDDlUnXA19J8mmGzlUcAizZlrElSTuuq1AA5gGzqmrYQzrDSXIV8CaGTlIPAhcBb0oyl6FAuZ+hcxZU1fIk1wJ3A+uAc6pq/XDjSpJ6p9tQWAb8S4bOA3Slqk4bpvmLW1h/MbC42/ElSSOv21DYC7g7yRLgmQ2NVXVCT6qSJPVFt6GwsJdFSJLGhm6vPvphrwuRJPVft1cfPcnz3xvYFZgM/HNV/YteFSZJGn3d7ilM7VxOchJOQyFplDyw6LB+lzBmHHDhXT0df7tmSa2qbwJvGeFaJEl91u3ho7d3LL6Ioe8tdP2dBUnS+NDt1Udv63i8jqEvnp044tVIkvqq23MK/6nXhUiS+q/bH9kZSPKN5pfUHknytSQDvS5OkjS6uj189A/AV4B3Ncvvbtr+XS+KUn94hcfzen2FhzRWdXv10fSq+oeqWtfcrgD8MQNJ2sl0GwqPJXl3kknN7d3A6l4WJkkafd2GwpnAKcBvGJop9Z2AJ58laSfT7TmFvwZOr6rHAZJMAz7JUFhIknYS3e4pvGZDIABU1W+Bw3tTkiSpX7oNhRclefmGhWZPodu9DEnSONHtP/ZPAf83yVcZmt7iFPyVNEna6XT7jeYvJVnK0CR4Ad5eVXf3tDJJ0qjr+hBQEwIGgSTtxLZr6mxJ0s7JUJAktQwFSVLLUJAktQwFSVLLUJAktQwFSVLLUJAktQwFSVLLUJAktXoWCkkuT/JokmUdbdOS3JTkl81958yrH05yb5IVSY7rVV2SpM3r5Z7CFcCCjdouAG6uqkOAm5tlkswCTgVmN30+n2RSD2uTJA2jZ6FQVbcAv92o+UTgyubxlcBJHe1XV9UzVXUfcC8wv1e1SZKGN9rnFF5RVQ8DNPd7N+37ASs71hts2jaR5OwkS5MsXbVqVU+LlaSJZqycaM4wbTXcilV1SVXNq6p506dP73FZkjSxjHYoPJJkH4Dm/tGmfRDYv2O9AeChUa5Nkia80Q6F64HTm8enA9d1tJ+a5MVJZgKHAEtGuTZJmvC6/uW1bZXkKuBNwF5JBoGLgE8A1yY5C3gAeBdAVS1Pci1Dv+y2Djinqtb3qjZJ0vB6FgpVddpmnjpmM+svBhb3qh5J0taNlRPNkqQxwFCQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLUMBUlSy1CQJLV26cdGk9wPPAmsB9ZV1bwk04BrgBnA/cApVfV4P+qTpImqn3sKb66quVU1r1m+ALi5qg4Bbm6WJUmjaCwdPjoRuLJ5fCVwUh9rkaQJqV+hUMB3k9yW5Oym7RVV9TBAc793n2qTpAmrL+cUgNdX1UNJ9gZuSvJP3XZsQuRsgAMOOKBX9UnShNSXPYWqeqi5fxT4BjAfeCTJPgDN/aOb6XtJVc2rqnnTp08frZIlaUIY9VBIsnuSqRseA38MLAOuB05vVjsduG60a5Okia4fh49eAXwjyYbtf6WqbkhyK3BtkrOAB4B39aE2SZrQRj0UqurXwJxh2lcDx4x2PZKk542lS1IlSX1mKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKllKEiSWoaCJKk15kIhyYIkK5Lcm+SCftcjSRPJmAqFJJOAvwf+PTALOC3JrP5WJUkTx5gKBWA+cG9V/bqqngWuBk7sc02SNGHs0u8CNrIfsLJjeRA4qnOFJGcDZzeLTyVZMUq17fQOhL2Ax/pdx5hwUfpdgTr4t9lhZP42D9zcE2MtFIZ7tfWChapLgEtGp5yJJcnSqprX7zqkjfm3OXrG2uGjQWD/juUB4KE+1SJJE85YC4VbgUOSzEyyK3AqcH2fa5KkCWNMHT6qqnVJPgDcCEwCLq+q5X0uayLxsJzGKv82R0mqautrSZImhLF2+EiS1EeGgiSpZSjIqUU0ZiW5PMmjSZb1u5aJwlCY4JxaRGPcFcCCfhcxkRgKcmoRjVlVdQvw237XMZEYChpuapH9+lSLpD4zFLTVqUUkTRyGgpxaRFLLUJBTi0hqGQoTXFWtAzZMLXIPcK1Ti2isSHIV8I/AoUkGk5zV75p2dk5zIUlquacgSWoZCpKklqEgSWoZCpKklqEgSWoZCpKklqEgNZL8WZI/HeExr0jyzubxZdszA22ShUkqySs72v6iaZvXLH8nycu2cdwRf70a/8bUbzRL3UiyS/OluxFVVf9jpMfcaPz37kD3uxj6tvnHm+V3And3jH38dtTT09er8ck9BfVNkt2T/J8kP0+yLMmfJDkiyQ+T3JbkxiT7NOv+IMl/TfJD4NzOT+DN8081929q+l+b5BdJPpHkPyZZkuSuJAdvoZ6FSc7r2N5/b/r9IsnRTfvspu2OJHcmOSTJjM4fgUlyXpKFw4z/g45P9k8lWdy89p8kecVW3q5v0kxpnuQg4AlgVcfY9yfZa7j3tHn+E0nubmr+5Da83pc07+WdSa5J8tMNr0E7J0NB/bQAeKiq5lTVq4EbgL8D3llVRwCXA4s71n9ZVf1RVX1qK+POAc4FDgPeA7yqquYDlwEf3Ib6dmn6fQi4qGn7M+Bvq2ouMI+hCQW3x+7AT6pqDnAL8L6trP97YGWSVwOnAddsZr1N3tMk04CTgdlV9Rqe39vY2HCv9z8Djzf9/ho4oruXp/HKUFA/3QUc23xCPZqh2VpfDdyU5A7gowzN2rrB5v4RbuzWqnq4qp4BfgV8t2N7M7ahvq8397d19PtH4CNJ/go4sKqe3obxOj0LfHuY8bfkaoYOIZ0EfGMz67zgPa2qJxgKlLXAZUneDqzZTN/hXu8bmu1SVcuAO7uoU+OYoaC+qapfMPTJ8y7gvwHvAJZX1dzmdlhV/XFHl3/ueLyO5u83SYBdO557puPxHzqW/8C2nUfb0G/9hn5V9RXgBOBp4MYkb+mspTGli7Gfq+cnHmvH34pvMbTn80BV/X64FTZ+T5Nc2Jx/mQ98jaFAuWEz42/yehn+9za0EzMU1DdJ9gXWVNX/Aj4JHAVMT/K65vnJSWZvpvv9PH8o40Rgco/LpanpIODXVfU5hqYYfw3wCLB3kj2TvBj4D73YdrNX8le88JDaxvVt/J6+NskewEur6jsMHRqauw2b/RFwSjP2LIYOyWkn5tVH6qfDgL9J8gfgOeDPGfrU/bkkL2Xo7/OzwHBTeV8KXJdkCXAzL9yL6KU/Ad6d5DngN8CiqnouySLgp8B9wD/1auNVdfVWVhnuPZ3K0Hs1haFP/n+xDZv8PHBlkjuBnzF0+OiJbS5c44ZTZ0varCSTgMlVtba5cutmhk7cP9vn0tQj7ilI2pKXAN9PMpmhvYw/NxB2bu4paMJJ8l+Ad23U/L+rarPH6kfDWK1LE4uhIElqefWRJKllKEiSWoaCJKllKEiSWv8fbXvX8rz+ZW4AAAAASUVORK5CYII=\n",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "#缺失值比较多，干脆就开一个新的字段，表明是缺失值还是不是缺失值\n",
    "train['serum_insulin_Missing'] = train['serum_insulin'].apply(lambda x: 1 if pd.isnull(x) else 0)\n",
    "sns.countplot(x=\"serum_insulin_Missing\", hue=\"Target\",data=train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "不过特征是否缺失好像和目标也没什么关系"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "train.drop([\"Triceps_skin_fold_thickness_Missing\", \"serum_insulin_Missing\"], axis=1, inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "感觉特征缺失是随机的，将这新增的特征删除，老实用中值填补算了。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pregnants                       0\n",
      "Plasma_glucose_concentration    0\n",
      "blood_pressure                  0\n",
      "Triceps_skin_fold_thickness     0\n",
      "serum_insulin                   0\n",
      "BMI                             0\n",
      "Diabetes_pedigree_function      0\n",
      "Age                             0\n",
      "Target                          0\n",
      "dtype: int64\n"
     ]
    }
   ],
   "source": [
    "medians = train.median() \n",
    "train = train.fillna(medians)\n",
    "\n",
    "print(train.isnull().sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据标准化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "#  get labels\n",
    "y_train = train['Target']   \n",
    "X_train = train.drop([\"Target\"], axis=1)\n",
    "\n",
    "#用于保存特征工程之后的结果\n",
    "feat_names = X_train.columns\n",
    "\n",
    "# 数据标准化\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# 初始化特征的标准化器\n",
    "ss_X = StandardScaler()\n",
    "\n",
    "# 分别对训练和测试数据的特征进行标准化处理\n",
    "X_train = ss_X.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 特征处理结果存为文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#存为csv格式\n",
    "X_train = pd.DataFrame(columns = feat_names, data = X_train)\n",
    "\n",
    "train = pd.concat([X_train, y_train], axis = 1)\n",
    "\n",
    "train.to_csv('FE_pima-indians-diabetes.csv',index = False,header=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<bound method DataFrame.info of      pregnants  Plasma_glucose_concentration  blood_pressure  \\\n",
       "0     0.639947                      0.866045       -0.031990   \n",
       "1    -0.844885                     -1.205066       -0.528319   \n",
       "2     1.233880                      2.016662       -0.693761   \n",
       "3    -0.844885                     -1.073567       -0.528319   \n",
       "4    -1.141852                      0.504422       -2.679076   \n",
       "..         ...                           ...             ...   \n",
       "763   1.827813                     -0.679069        0.298896   \n",
       "764  -0.547919                      0.011301       -0.197433   \n",
       "765   0.342981                     -0.021574       -0.031990   \n",
       "766  -0.844885                      0.142800       -1.024647   \n",
       "767  -0.844885                     -0.942068       -0.197433   \n",
       "\n",
       "     Triceps_skin_fold_thickness  serum_insulin       BMI  \\\n",
       "0                       0.670643      -0.181541  0.166619   \n",
       "1                      -0.012301      -0.181541 -0.852200   \n",
       "2                      -0.012301      -0.181541 -1.332500   \n",
       "3                      -0.695245      -0.540642 -0.633881   \n",
       "4                       0.670643       0.316566  1.549303   \n",
       "..                           ...            ...       ...   \n",
       "763                     2.150354       0.455573  0.064737   \n",
       "764                    -0.239949      -0.181541  0.632365   \n",
       "765                    -0.695245      -0.332132 -0.910418   \n",
       "766                    -0.012301      -0.181541 -0.342790   \n",
       "767                     0.215347      -0.181541 -0.299127   \n",
       "\n",
       "     Diabetes_pedigree_function       Age  Target  \n",
       "0                      0.468492  1.425995       1  \n",
       "1                     -0.365061 -0.190672       0  \n",
       "2                      0.604397 -0.105584       1  \n",
       "3                     -0.920763 -1.041549       0  \n",
       "4                      5.484909 -0.020496       1  \n",
       "..                          ...       ...     ...  \n",
       "763                   -0.908682  2.532136       0  \n",
       "764                   -0.398282 -0.531023       0  \n",
       "765                   -0.685193 -0.275760       0  \n",
       "766                   -0.371101  1.170732       1  \n",
       "767                   -0.473785 -0.871374       0  \n",
       "\n",
       "[768 rows x 9 columns]>"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.info"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pregnants</th>\n",
       "      <th>Plasma_glucose_concentration</th>\n",
       "      <th>blood_pressure</th>\n",
       "      <th>Triceps_skin_fold_thickness</th>\n",
       "      <th>serum_insulin</th>\n",
       "      <th>BMI</th>\n",
       "      <th>Diabetes_pedigree_function</th>\n",
       "      <th>Age</th>\n",
       "      <th>Target</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>7.680000e+02</td>\n",
       "      <td>7.680000e+02</td>\n",
       "      <td>7.680000e+02</td>\n",
       "      <td>7.680000e+02</td>\n",
       "      <td>7.680000e+02</td>\n",
       "      <td>7.680000e+02</td>\n",
       "      <td>7.680000e+02</td>\n",
       "      <td>7.680000e+02</td>\n",
       "      <td>768.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>-6.476301e-17</td>\n",
       "      <td>4.625929e-18</td>\n",
       "      <td>5.782412e-18</td>\n",
       "      <td>-1.526557e-16</td>\n",
       "      <td>1.503427e-17</td>\n",
       "      <td>2.613650e-16</td>\n",
       "      <td>2.451743e-16</td>\n",
       "      <td>1.931325e-16</td>\n",
       "      <td>0.348958</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.000652e+00</td>\n",
       "      <td>1.000652e+00</td>\n",
       "      <td>1.000652e+00</td>\n",
       "      <td>1.000652e+00</td>\n",
       "      <td>1.000652e+00</td>\n",
       "      <td>1.000652e+00</td>\n",
       "      <td>1.000652e+00</td>\n",
       "      <td>1.000652e+00</td>\n",
       "      <td>0.476951</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-1.141852e+00</td>\n",
       "      <td>-2.552931e+00</td>\n",
       "      <td>-4.002619e+00</td>\n",
       "      <td>-2.516429e+00</td>\n",
       "      <td>-1.467353e+00</td>\n",
       "      <td>-2.074783e+00</td>\n",
       "      <td>-1.189553e+00</td>\n",
       "      <td>-1.041549e+00</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>-8.448851e-01</td>\n",
       "      <td>-7.201630e-01</td>\n",
       "      <td>-6.937615e-01</td>\n",
       "      <td>-4.675972e-01</td>\n",
       "      <td>-2.220849e-01</td>\n",
       "      <td>-7.212087e-01</td>\n",
       "      <td>-6.889685e-01</td>\n",
       "      <td>-7.862862e-01</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>-2.509521e-01</td>\n",
       "      <td>-1.530732e-01</td>\n",
       "      <td>-3.198993e-02</td>\n",
       "      <td>-1.230129e-02</td>\n",
       "      <td>-1.815412e-01</td>\n",
       "      <td>-2.258989e-02</td>\n",
       "      <td>-3.001282e-01</td>\n",
       "      <td>-3.608474e-01</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>6.399473e-01</td>\n",
       "      <td>6.112653e-01</td>\n",
       "      <td>6.297816e-01</td>\n",
       "      <td>3.291706e-01</td>\n",
       "      <td>-1.554775e-01</td>\n",
       "      <td>6.032562e-01</td>\n",
       "      <td>4.662269e-01</td>\n",
       "      <td>6.602056e-01</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>3.906578e+00</td>\n",
       "      <td>2.542658e+00</td>\n",
       "      <td>4.104082e+00</td>\n",
       "      <td>7.955377e+00</td>\n",
       "      <td>8.170442e+00</td>\n",
       "      <td>5.042397e+00</td>\n",
       "      <td>5.883565e+00</td>\n",
       "      <td>4.063716e+00</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          pregnants  Plasma_glucose_concentration  blood_pressure  \\\n",
       "count  7.680000e+02                  7.680000e+02    7.680000e+02   \n",
       "mean  -6.476301e-17                  4.625929e-18    5.782412e-18   \n",
       "std    1.000652e+00                  1.000652e+00    1.000652e+00   \n",
       "min   -1.141852e+00                 -2.552931e+00   -4.002619e+00   \n",
       "25%   -8.448851e-01                 -7.201630e-01   -6.937615e-01   \n",
       "50%   -2.509521e-01                 -1.530732e-01   -3.198993e-02   \n",
       "75%    6.399473e-01                  6.112653e-01    6.297816e-01   \n",
       "max    3.906578e+00                  2.542658e+00    4.104082e+00   \n",
       "\n",
       "       Triceps_skin_fold_thickness  serum_insulin           BMI  \\\n",
       "count                 7.680000e+02   7.680000e+02  7.680000e+02   \n",
       "mean                 -1.526557e-16   1.503427e-17  2.613650e-16   \n",
       "std                   1.000652e+00   1.000652e+00  1.000652e+00   \n",
       "min                  -2.516429e+00  -1.467353e+00 -2.074783e+00   \n",
       "25%                  -4.675972e-01  -2.220849e-01 -7.212087e-01   \n",
       "50%                  -1.230129e-02  -1.815412e-01 -2.258989e-02   \n",
       "75%                   3.291706e-01  -1.554775e-01  6.032562e-01   \n",
       "max                   7.955377e+00   8.170442e+00  5.042397e+00   \n",
       "\n",
       "       Diabetes_pedigree_function           Age      Target  \n",
       "count                7.680000e+02  7.680000e+02  768.000000  \n",
       "mean                 2.451743e-16  1.931325e-16    0.348958  \n",
       "std                  1.000652e+00  1.000652e+00    0.476951  \n",
       "min                 -1.189553e+00 -1.041549e+00    0.000000  \n",
       "25%                 -6.889685e-01 -7.862862e-01    0.000000  \n",
       "50%                 -3.001282e-01 -3.608474e-01    0.000000  \n",
       "75%                  4.662269e-01  6.602056e-01    1.000000  \n",
       "max                  5.883565e+00  4.063716e+00    1.000000  "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
